Diamonds Price Dataset

Diamonds.Prices2022 = read.csv("/Users/noamgonen/Desktop/pstat 126/Diamonds_Prices2022.csv", header = TRUE)
head(Diamonds.Prices2022)
##   X carat       cut color clarity depth table price    x    y    z
## 1 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Above I have printed the first 6 observations in the data. To have a better understanding of the data I have described each variable, here is what i have gathered:

Data Description and Descriptive Statistics

Summary of the data statistics:

library(skimr)
skim(Diamonds.Prices2022)
Data summary
Name Diamonds.Prices2022
Number of rows 53943
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
cut 0 1 4 9 0 5 0
color 0 1 1 1 0 7 0
clarity 0 1 2 4 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X 0 1 26972.00 15572.15 1.0 13486.50 26972.00 40457.50 53943.00 ▇▇▇▇▇
carat 0 1 0.80 0.47 0.2 0.40 0.70 1.04 5.01 ▇▂▁▁▁
depth 0 1 61.75 1.43 43.0 61.00 61.80 62.50 79.00 ▁▁▇▁▁
table 0 1 57.46 2.23 43.0 56.00 57.00 59.00 95.00 ▁▇▁▁▁
price 0 1 3932.73 3989.34 326.0 950.00 2401.00 5324.00 18823.00 ▇▂▁▁▁
x 0 1 5.73 1.12 0.0 4.71 5.70 6.54 10.74 ▁▁▇▃▁
y 0 1 5.73 1.14 0.0 4.72 5.71 6.54 58.90 ▇▁▁▁▁
z 0 1 3.54 0.71 0.0 2.91 3.53 4.04 31.80 ▇▁▁▁▁

Summary of the new data after selecting 500 random observations:

set.seed(123)
DiamondData <- Diamonds.Prices2022 %>% sample_n(500) %>% unique() %>% select(-(X))
attach(DiamondData)
skim(DiamondData)
Data summary
Name DiamondData
Number of rows 500
Number of columns 10
_______________________
Column type frequency:
character 3
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
cut 0 1 4 9 0 5 0
color 0 1 1 1 0 7 0
clarity 0 1 2 4 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
carat 0 1 0.80 0.46 0.23 0.38 0.71 1.05 2.30 ▇▆▂▂▁
depth 0 1 61.83 1.36 57.80 61.00 61.90 62.50 67.70 ▁▆▇▁▁
table 0 1 57.38 2.21 52.00 56.00 57.00 59.00 66.00 ▁▇▅▁▁
price 0 1 3922.04 3970.49 368.00 956.75 2537.00 5283.75 18118.00 ▇▃▁▁▁
x 0 1 5.73 1.12 3.93 4.65 5.73 6.56 8.43 ▇▅▇▅▂
y 0 1 5.73 1.11 3.96 4.64 5.72 6.55 8.46 ▇▅▇▃▂
z 0 1 3.54 0.69 2.38 2.85 3.53 4.03 5.28 ▇▆▇▃▂

Here we can see we have 300 rows (obs) and 10 columns (variables), 9 being the varibales that I can use to create my regression model and the 10th column is an X column to mark the observation number. 3 of the variables are character types and 7 are numeric (including the X column). The three character variables are cut, clarity and color, we can see we have no missing values, and the cut variable has 5 unique categories, clarity has 8 and the color variable has 7 categories. For the numeric variables we can see their means, standard deviations, quantiles, and a small look at their histograms which to summarize is as follows:

Carat: no missing observations, mean: 0.82, sd: 0.46, median: 0.71, range: (0.2300, 2.300), the distribution is skewed right, with very large outliers.

Price: no missing observations, mean: 4082.67, sd: 4000.41, median: 2642.5, range: (383.0, 18108.0), the distribution is right skewed.

Depth: no missing observations, mean: 61.85, sd: 1.33, median: 61.90, range: (57.80, 67.20), the distribution shows small and large outliers in the population, but data is centered in the middle.

Table: no missing observations, mean: 57.28, sd: 2.16, median: 57.00, range: (53.00, 66.00), distribution shows a large outlier as data is skewed right.

X: no missing observations, mean: 5.79, sd: 1.11, median: 5.79, range: (3.93, 8.43), data is spread out with no outliers.

Y: no missing observations, mean: 5.79, sd: 1.1, median: 5.78, range: (3.96, 8.46), data is spread out with no outliers.

Z: no missing observations, mean: 3.58, sd: 0.68, median: 3.54, range: (2.38, 5.28), data is spread out with no outliers.

From the two summaries above, I can see that the distributions of the numerical variables weren’t effected, the shapes stayed the same, the only noticeable change was in the x,y, and z variables, but that is probably due to the loss of outliars.

ggplot(Diamonds.Prices2022, aes(x=color)) + geom_bar(fill='red') +  labs(x='Color')

ggplot(DiamondData, aes(x=color)) + geom_bar(fill='blue') +  labs(x='Color')

ggplot(Diamonds.Prices2022, aes(x=cut)) + geom_bar(fill='red') +  labs(x='Color')

ggplot(DiamondData, aes(x=cut)) + geom_bar(fill='blue') +  labs(x='Color')

ggplot(Diamonds.Prices2022, aes(x=clarity)) + geom_bar(fill='red') +  labs(x='Color')

ggplot(DiamondData, aes(x=clarity)) + geom_bar(fill='blue') +  labs(x='Color')

The population data distribution (red) compared to the the sample data distribution (blue) all look nearly identical for all three of the categorical variables. Overall I would say I was able to get a representative sample data.

A Full Model Before Any Further Analysis of The Data:

\(price = \beta_0 + \beta_1 carat + \beta_2 cut + \beta_3 color + \beta_4 clarity + \beta_5 depth + \beta_1 table + \beta_2 x + \beta_3 y + \beta_4 z + \epsilon\)

Summary of Regression Model

lm1 = lm(price ~ carat + table + depth + cut + color + clarity + x + y + z)
summary(lm1)
## 
## Call:
## lm(formula = price ~ carat + table + depth + cut + color + clarity + 
##     x + y + z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4166.1  -559.7  -164.5   383.5  5285.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8797.41   13449.58  -0.654   0.5134    
## carat        13777.32     573.02  24.043  < 2e-16 ***
## table          -50.43      30.63  -1.646   0.1004    
## depth          195.40     211.10   0.926   0.3551    
## cutGood        613.22     357.50   1.715   0.0869 .  
## cutIdeal       824.32     369.45   2.231   0.0261 *  
## cutPremium     655.62     348.33   1.882   0.0604 .  
## cutVery Good   885.34     351.54   2.518   0.0121 *  
## colorE        -196.33     190.13  -1.033   0.3023    
## colorF        -175.35     187.31  -0.936   0.3497    
## colorG        -438.40     187.33  -2.340   0.0197 *  
## colorH       -1038.16     193.84  -5.356 1.33e-07 ***
## colorI       -1468.93     230.78  -6.365 4.59e-10 ***
## colorJ       -2188.73     265.30  -8.250 1.56e-15 ***
## clarityIF     4604.14     436.94  10.537  < 2e-16 ***
## claritySI1    3463.71     374.63   9.246  < 2e-16 ***
## claritySI2    2473.12     381.44   6.484 2.24e-10 ***
## clarityVS1    4373.50     388.14  11.268  < 2e-16 ***
## clarityVS2    4373.75     384.18  11.385  < 2e-16 ***
## clarityVVS1   4774.12     426.40  11.196  < 2e-16 ***
## clarityVVS2   4765.38     408.61  11.663  < 2e-16 ***
## x             -874.30    1552.45  -0.563   0.5736    
## y             1803.14    1591.29   1.133   0.2577    
## z            -4735.42    3392.24  -1.396   0.1634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1078 on 476 degrees of freedom
## Multiple R-squared:  0.9296, Adjusted R-squared:  0.9262 
## F-statistic: 273.4 on 23 and 476 DF,  p-value: < 2.2e-16

Correlation Plots to Check for Overfitting and Multicollinearity

plot(DiamondData)

From the above correlation matrix, I can see that variables x, y, z, and carat have strong positive linear correlations. To check this further, I will calculate the correlation of each pair, if the variables have a correlation of 1 then they have a perfect positive linear relationship.

DiamondData2 <- subset(DiamondData, select = c("price", "carat", "depth", "table", "x", "y", "z"))
corrplot(cor(DiamondData2), method = "number")

Above we can see the correlation between all the variables and, as assumed, price, carat, x, y, and z have very strong correlations whereas table and depth do not.

Because carat, x, y and z have very strong correlations only one of them can be used in my regression model because its redundant and causes multicollinearity to use them all.

Linear Regression Models

Next step is to conduct significance tests at \(\alpha\) = 0.05 to test which variables are significantly related to the response variable, price. To do this I will conduct a hypothesis test to check the significance of the slope in the simple linear regression models between the numerical dependent variables and the independent variable. I will use the hypothesis:

\(H0: \beta_1 = 0\)

\(H1: \beta_1 \not= 0\)

linear model for price and carat:

lm_carat = lm(price ~ carat)
summary(lm_carat)
## 
## Call:
## lm(formula = price ~ carat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5785.6  -850.9     1.8   619.4  7712.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2361.6      139.6  -16.92   <2e-16 ***
## carat         7868.4      151.2   52.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1566 on 498 degrees of freedom
## Multiple R-squared:  0.8447, Adjusted R-squared:  0.8443 
## F-statistic:  2708 on 1 and 498 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = -2361.6 + 7868.4 carat + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that carat has a significant relation to price. The \(R^2adj\) is 0.844 which means about 84% of the variance in the dependent variable, price, can be explained by the model.

linear model for price and depth:

lm_depth = lm(price ~ depth)
summary(lm_depth)
## 
## Call:
## lm(formula = price ~ depth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3548  -2958  -1376   1358  14224 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2365.29    8091.14   0.292    0.770
## depth          25.18     130.83   0.192    0.847
## 
## Residual standard error: 3974 on 498 degrees of freedom
## Multiple R-squared:  7.436e-05,  Adjusted R-squared:  -0.001934 
## F-statistic: 0.03704 on 1 and 498 DF,  p-value: 0.8475

the fitted model can be explained by the following equation:

\(price = 2365.29 + 25.18 depth + \epsilon\)

and has a p-value greater than 0.05, therefore \(HO\) is accepted, and we conclude that depth does not have a significant relation to price. The \(R^2adj\) is -0.0019 which is not good, and might be something to consider for my later model.

linear model for price and table:

lm_table = lm(price ~ table)
summary(lm_table)
## 
## Call:
## lm(formula = price ~ table)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5448  -2763  -1427   1299  14969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9375.51    4574.40  -2.050  0.04093 * 
## table         231.75      79.66   2.909  0.00379 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3941 on 498 degrees of freedom
## Multiple R-squared:  0.01671,    Adjusted R-squared:  0.01474 
## F-statistic: 8.463 on 1 and 498 DF,  p-value: 0.003787

the fitted model can be explained by the following equation:

\(price = -9375.51 + 231.75 table + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that table has a significant relation to price. The \(R^2adj\) 0.0147 which is not the best value, but I don’t think it would hurt my model to include.

linear model for price and x (length):

lm_x = lm(price ~ x)
summary(lm_x)
## 
## Call:
## lm(formula = price ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4662.8 -1283.9  -215.9   971.3  8662.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13912.77     446.85  -31.14   <2e-16 ***
## x             3111.51      76.52   40.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1912 on 498 degrees of freedom
## Multiple R-squared:  0.7685, Adjusted R-squared:  0.7681 
## F-statistic:  1654 on 1 and 498 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = -13912.77 + 3111.51 length + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that length has a significant relation to price. The \(R^2adj\) is 0.768 which means about 76.8% of the variance in the dependent variable, price, can be explained by the model.

linear model for price and y (width):

lm_y = lm(price ~ y)
summary(lm_y)
## 
## Call:
## lm(formula = price ~ y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4596.8 -1313.9  -254.2   986.6  8468.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14083.71     442.05  -31.86   <2e-16 ***
## y             3140.49      75.69   41.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1883 on 498 degrees of freedom
## Multiple R-squared:  0.7756, Adjusted R-squared:  0.7752 
## F-statistic:  1722 on 1 and 498 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = -14083.71 + 3140.49 width + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that width has a significant relation to price. The \(R^2adj\) is 0.7752 which means about 77.52% of the variance in the dependent variable, price, can be explained by the model.

linear model for price and z (height):

lm_z = lm(price ~ z)
summary(lm_z)
## 
## Call:
## lm(formula = price ~ z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6444.0 -1280.9  -193.0   914.9  8420.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13743.6      452.1  -30.40   <2e-16 ***
## z             4983.7      125.2   39.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1943 on 498 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.7605 
## F-statistic:  1585 on 1 and 498 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = -13743.6 + 4983.7 height + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that height has a significant relation to price. The \(R^2adj\) is 0.7605 which means about 76.05% of the variance in the dependent variable, price, can be explained by the model.

Next I will find a confidence interval for the two variables price and carat.

summary_stats <- summary(lm_carat)
coeffs <- summary_stats$coefficients[, "Estimate"]
standard_error <- summary_stats$coefficients[, "Std. Error"]
t_crit <- qt(1-0.05/2, df = summary_stats$df[2])

lower_bound <- coeffs - t_crit * standard_error
upper_bound <- coeffs + t_crit * standard_error

conf_int <- cbind(lower_bound, upper_bound)
conf_int
##             lower_bound upper_bound
## (Intercept)   -2635.935   -2087.360
## carat          7571.296    8165.467

Based on the results of the above code, I can say I am 96% confident that the true effect of carat on price is between 7571.296 and 8165.465 dollars per carat.

Check Assumption of Linearity, and Check for Constant variance

I further check the numeric variables for linearity and for constant variance by plotting the simple linear models on a residuals versus fits plot.

Carat and Price Residual Fitted Plot

residuals_carat <- residuals(lm_carat)
fitted_carat <- fitted(lm_carat)
plot(fitted_carat, residuals_carat, xlab='Fitted', ylab = 'Residuals', main = 'Carat vs. Price | Residuals vs. Fitted')

This plot is very crowded on the left side and spread out on the right side which suggests a non constant variance (heteroscedasticity) which is not what we want. To try and fix this I can transform the linear model by taking the log of the model.

slr_carat_adj <- lm(log(price)~log(carat))
residuals_carat_adj <- residuals(slr_carat_adj)
fitted_carat_adj <- fitted(slr_carat_adj)
plot(fitted_carat_adj, residuals_carat_adj, xlab='Fitted', ylab = 'Residuals', main = 'Carat vs. Price | Residuals vs. Fitted Adjusted')

after taking the log of both carat and price we can see the spread of the plot better and we also notice no pattern, proving the assumption of linearity and constant variance.

Depth and Price Residual Fitted Plot

residuals_depth <- residuals(lm_depth)
fitted_depth <- fitted(lm_depth)
plot(fitted_depth, residuals_depth, xlab='Fitted', ylab = 'Residuals', main = 'Depth vs. Price | Residuals vs. Fitted')

This is not centered around 0, and not very spread out which is not what we want. To try and fix this I can transform the linear model by taking the log once again.

slr_depth_adj <- lm(log(price)~log(depth))
residuals_depth_adj <- residuals(slr_depth_adj)
fitted_depth_adj <- fitted(slr_depth_adj)
plot(fitted_depth_adj, residuals_depth_adj, xlab='Fitted', ylab = 'Residuals', main = 'Depth vs. Price | Residuals vs. Fitted Adjusted')

This definitely made the plot a lot better, we can now see a constant variance and linearity.

Table and Price Residual Fitted Plot

residuals_table <- residuals(lm_table)
fitted_table <- fitted(lm_table)
plot(fitted_table, residuals_table, xlab='Fitted', ylab = 'Residuals', main = 'Table vs. Price | Residuals vs. Fitted')

This plot is showing the points in vertical lines which is normal for this variable since the table variable only has whole numbers. I can transform this as well by taking the log to make the points more continuous.

slr_table_adj <- lm(log(price)~log(table))
residuals_table_adj <- residuals(slr_table_adj)
fitted_table_adj <- fitted(slr_table_adj)
plot(fitted_table_adj, residuals_table_adj, xlab='Fitted', ylab = 'Residuals', main = 'Table vs. Price | Residuals vs. Fitted Adjusted')

Now we can see the spread of the plot better and we also notice no pattern, proving the assumption of linearity and constant variance.

Length and Price Residual Fitted Plot

residuals_x <- residuals(lm_x)
fitted_x <- fitted(lm_x)
plot(fitted_x, residuals_x, xlab='Fitted', ylab = 'Residuals', main = 'Length vs. Price | Residuals vs. Fitted')

The x (length) variable is having the same issue as the carat plot above.

slr_x_adj <- lm(log(price)~x)
residuals_x_adj <- residuals(slr_x_adj)
fitted_x_adj <- fitted(slr_x_adj)
plot(fitted_x_adj, residuals_x_adj, xlab='Fitted', ylab = 'Residuals', main = 'Length vs. Price | Residuals vs. Fitted Adjusted')

However unlike the carat plot, once transformed the plot is looking much better, its way more spread out, and there is no noticeable pattern, proving the assumption of linearity and constant variance

Width and Price Residual Fitted Plot

residuals_y <- residuals(lm_y)
fitted_y <- fitted(lm_y)
plot(fitted_y, residuals_y, xlab='Fitted', ylab = 'Residuals', main = 'Width vs. Price | Residuals vs. Fitted')

same as above

slr_y_adj <- lm(log(price)~y)
residuals_y_adj <- residuals(slr_y_adj)
fitted_y_adj <- fitted(slr_y_adj)
plot(fitted_y_adj, residuals_y_adj, xlab='Fitted', ylab = 'Residuals', main = 'Width vs. Price | Residuals vs. Fitted Adjusted')

same conclusion. Proving the assumption of linearity and constant variance.

Height and Price Residual Fitted Plot

residuals_z <- residuals(lm_z)
fitted_z <- fitted(lm_z)
plot(fitted_z, residuals_z, xlab='Fitted', ylab = 'Residuals', main = 'Height vs. Price | Residuals vs. Fitted')

same as above

slr_z_adj <- lm(log(price)~z)
residuals_z_adj <- residuals(slr_z_adj)
fitted_z_adj <- fitted(slr_z_adj)
plot(fitted_x_adj, residuals_z_adj, xlab='Fitted', ylab = 'Residuals', main = 'Height vs. Price | Residuals vs. Fitted Adjusted')

Same conclusion. Proving the assumption of linearity and constant variance.

Checking for Normality

Next I need to check for normality to make sure the error term is normally distributed

Carat Normality Plot

qqnorm(residuals_carat_adj)

Depth Normality Plot

qqnorm(residuals_depth_adj)

Table Normality Plot

qqnorm(residuals_table_adj)

Length Normality Plot

qqnorm(residuals_x_adj)

Width Normality Plot

qqnorm(residuals_y_adj)

Height Normality Plot

qqnorm(residuals_z_adj)

All the normality plots have a strong linear relationship when I transform the response variable. This tells me they all follow the normality assumption.

After testing all the error assumptions: constant variance, linear, and normally distributed. All of my variables after undergoing a transformation were able to pass all the assumptions, and are all candidates to use in my final model.

Summary of Adjusted Variables

summary(slr_carat_adj)
## 
## Call:
## lm(formula = log(price) ~ log(carat))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00690 -0.16608  0.00384  0.15534  0.82045 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.44370    0.01368  617.30   <2e-16 ***
## log(carat)   1.66870    0.01925   86.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2544 on 498 degrees of freedom
## Multiple R-squared:  0.9378, Adjusted R-squared:  0.9377 
## F-statistic:  7512 on 1 and 498 DF,  p-value: < 2.2e-16
summary(slr_depth_adj)
## 
## Call:
## lm(formula = log(price) ~ log(depth))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87309 -0.91927  0.06128  0.79933  2.04127 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    2.803      8.599   0.326    0.745
## log(depth)     1.208      2.085   0.579    0.563
## 
## Residual standard error: 1.02 on 498 degrees of freedom
## Multiple R-squared:  0.0006736,  Adjusted R-squared:  -0.001333 
## F-statistic: 0.3357 on 1 and 498 DF,  p-value: 0.5626
summary(slr_table_adj)
## 
## Call:
## lm(formula = log(price) ~ log(table))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.43086 -0.88148  0.01703  0.73660  2.36070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -15.306      4.733  -3.234   0.0013 ** 
## log(table)     5.703      1.169   4.879 1.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9966 on 498 degrees of freedom
## Multiple R-squared:  0.04562,    Adjusted R-squared:  0.0437 
## F-statistic:  23.8 on 1 and 498 DF,  p-value: 1.439e-06
summary(slr_x_adj)
## 
## Call:
## lm(formula = log(price) ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82411 -0.17176  0.01574  0.16500  0.85079 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.74779    0.06276   43.78   <2e-16 ***
## x            0.87888    0.01075   81.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2686 on 498 degrees of freedom
## Multiple R-squared:  0.9307, Adjusted R-squared:  0.9305 
## F-statistic:  6687 on 1 and 498 DF,  p-value: < 2.2e-16
summary(slr_y_adj)
## 
## Call:
## lm(formula = log(price) ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80200 -0.17789  0.01491  0.16425  0.80330 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.71616    0.06194   43.85   <2e-16 ***
## y            0.88416    0.01061   83.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2638 on 498 degrees of freedom
## Multiple R-squared:  0.9331, Adjusted R-squared:  0.933 
## F-statistic:  6950 on 1 and 498 DF,  p-value: < 2.2e-16
summary(slr_z_adj)
## 
## Call:
## lm(formula = log(price) ~ z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32813 -0.18182  0.00425  0.17763  0.88272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.7931     0.0661   42.26   <2e-16 ***
## z             1.4084     0.0183   76.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2841 on 498 degrees of freedom
## Multiple R-squared:  0.9225, Adjusted R-squared:  0.9223 
## F-statistic:  5924 on 1 and 498 DF,  p-value: < 2.2e-16

Here we can see how our significance and the \(R^2adj\) values have changed. Depth, even though now fits the error assumptions, still does not have a significant effect on price, so we will rule that out. Table’s significance has grown, and the \(R^2adj\) grew which is good for my model.

Dummy Variables

linear model for price and cut:

lm_cut = lm(log(price) ~ log(carat) + cut)
summary(lm_cut)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + cut)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72340 -0.16972  0.00027  0.15934  0.76215 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.14488    0.06188 131.634  < 2e-16 ***
## log(carat)    1.69512    0.01963  86.343  < 2e-16 ***
## cutGood       0.29396    0.07285   4.035 6.32e-05 ***
## cutIdeal      0.35485    0.06532   5.432 8.74e-08 ***
## cutPremium    0.27820    0.06586   4.224 2.86e-05 ***
## cutVery Good  0.31075    0.06656   4.668 3.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2475 on 494 degrees of freedom
## Multiple R-squared:  0.9416, Adjusted R-squared:  0.941 
## F-statistic:  1593 on 5 and 494 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = -4367.3 + 8058.1carat + 1642.1 cutGood - 2096.7 cutIdeal + 1547.6 cutPremium + 2096.4 cutVeryGood + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that cut has a significant relation to price

linear model for price and color:

lm_color = lm(log(price) ~ log(carat) + color)
summary(lm_color)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + color)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0590 -0.1305 -0.0025  0.1373  0.6669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.59395    0.03319 258.930  < 2e-16 ***
## log(carat)   1.70755    0.01812  94.242  < 2e-16 ***
## colorE      -0.08220    0.04028  -2.041   0.0418 *  
## colorF      -0.09463    0.03952  -2.395   0.0170 *  
## colorG      -0.07312    0.03908  -1.871   0.0619 .  
## colorH      -0.22933    0.04042  -5.674 2.39e-08 ***
## colorI      -0.26524    0.04842  -5.478 6.87e-08 ***
## colorJ      -0.45093    0.05590  -8.067 5.55e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2325 on 492 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.948 
## F-statistic:  1299 on 7 and 492 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = 3517 -443.8 colorE + 616.2 colorF -163.7colorG + 1047.9colorH + 1811.3 colorI + 1047.3 colorJ+ \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that color has a significant relation to price

linear model for price and clarity:

lm_clarity = lm(log(price) ~ log(carat) + clarity)
summary(lm_clarity)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + clarity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56731 -0.09920  0.00946  0.12845  0.54031 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.77622    0.05846  133.02  < 2e-16 ***
## log(carat)   1.80368    0.01545  116.76  < 2e-16 ***
## clarityIF    1.04194    0.07097   14.68  < 2e-16 ***
## claritySI1   0.63001    0.06099   10.33  < 2e-16 ***
## claritySI2   0.48428    0.06209    7.80 3.76e-14 ***
## clarityVS1   0.82233    0.06293   13.07  < 2e-16 ***
## clarityVS2   0.75568    0.06204   12.18  < 2e-16 ***
## clarityVVS1  1.01884    0.06929   14.70  < 2e-16 ***
## clarityVVS2  0.93599    0.06636   14.10  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1846 on 491 degrees of freedom
## Multiple R-squared:  0.9677, Adjusted R-squared:  0.9672 
## F-statistic:  1840 on 8 and 491 DF,  p-value: < 2.2e-16

the fitted model can be explained by the following equation:

\(price = 4055.8 -2082.8 clarityIF + 81.7 claritySI1 + 1102.0 claritySI2 - 536 clarityVS1 - 132.7 clarityVS2 - 1154.6 clarityVVS1 - 483.7clarityVVS2 + \epsilon\)

and has a p-value less than 0.05, therefore \(HO\) is rejected, and we conclude that clarity has a significant relation to price

Overall I would say color and clarity are the best categorical variables as they have the smallest p values and the highest \(R^2adj\), however cut is also a good variable in the model.

Adding to the Carat Model

the \(R^2adj\) value for the transformed carat price model in 0.9377, I will now test the change in this value when I add more variables into the model. I already saw the effect when adding the categorical variables. The value of \(R^2adj\) went up to 0.941 with cut, 0.948 with color, and 0.967 with clarity.

Adding just the variable table increased \(R^2adj\) to 0.9382, adding just x increased to 0.9382, just y increased it to 0.9391, and z increased it to 0.9377. Surprisingly adding depth increased the \(R^2adj\) value the most.

adding depth and table to the price~carat model increased it to 0.9405, adding x, y, or z to that model did not increase the \(R^2adj\) value so those get emitted.

Next let’s try adding the categorical variables to the price~ carat + depth + table model.

x and y to the price~carat model all improved my \(R^2\) value, however adding z to the model did not. This tells me there is multicollinearity with the z variable and carat. Adding cut increased it to 0.9416, clarity increased it to 0.9684 and including color increases it even higher to 0.985.

However, I noticed when running the summary that table and depth were not significant in my model so I removed them and instead added z. This ended up giving me a model with an \(R^2adj\) of 0.9853.

Part 3

After completing the tests above, I have decided this is the best model:

\[ log(price) = \beta_0 + \beta_1log(carat) + \beta_2color + \beta_3clarity + \beta_4cut + \beta_5z + \epsilon \]

lm2 = lm(log(price) ~ log(carat) + z + cut + clarity + color)
summary(lm2)
## 
## Call:
## lm(formula = log(price) ~ log(carat) + z + cut + clarity + color)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37790 -0.08161 -0.00223  0.08039  0.34837 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.79092    0.30134  22.536  < 2e-16 ***
## log(carat)    1.59796    0.08492  18.817  < 2e-16 ***
## z             0.24569    0.07248   3.390 0.000757 ***
## cutGood       0.13077    0.03837   3.408 0.000710 ***
## cutIdeal      0.20503    0.03462   5.922 6.05e-09 ***
## cutPremium    0.16490    0.03520   4.685 3.66e-06 ***
## cutVery Good  0.17011    0.03501   4.859 1.60e-06 ***
## clarityIF     1.10852    0.04963  22.335  < 2e-16 ***
## claritySI1    0.65966    0.04291  15.374  < 2e-16 ***
## claritySI2    0.48112    0.04361  11.033  < 2e-16 ***
## clarityVS1    0.87187    0.04421  19.720  < 2e-16 ***
## clarityVS2    0.80088    0.04375  18.308  < 2e-16 ***
## clarityVVS1   1.08047    0.04845  22.302  < 2e-16 ***
## clarityVVS2   0.96875    0.04642  20.867  < 2e-16 ***
## colorE       -0.06388    0.02166  -2.950 0.003336 ** 
## colorF       -0.11449    0.02133  -5.368 1.24e-07 ***
## colorG       -0.14506    0.02129  -6.815 2.84e-11 ***
## colorH       -0.27813    0.02198 -12.655  < 2e-16 ***
## colorI       -0.39323    0.02632 -14.941  < 2e-16 ***
## colorJ       -0.55634    0.03025 -18.389  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1234 on 480 degrees of freedom
## Multiple R-squared:  0.9859, Adjusted R-squared:  0.9853 
## F-statistic:  1766 on 19 and 480 DF,  p-value: < 2.2e-16

Transforming the response variable by taking the log of price, and the log of carat helped increase the \(R^2adj\) value, so I will continue to use this transformation. I also noticed including table and depth didn’t help my model due to the lack of correlation to price causing overfitting, and x and y were not helpful because of their strong correlation to carat which caused multicollinearity.

AIC

Now I will create a model using criterion based methods to select the variables. I will start by using Akaike Information Criterion (AIC) to tell me which predictors to select. Below I have generated models using three different testing methods: backward elimination, forward selection, and step wise methods.

AIC Backward Elimination

full_model <- lm(log(price)~., DiamondData)
step(full_model, direction='backward')
## Start:  AIC=-2059.6
## log(price) ~ carat + cut + color + clarity + depth + table + 
##     x + y + z
## 
##           Df Sum of Sq     RSS     AIC
## - y        1    0.0078  7.3924 -2061.1
## - depth    1    0.0121  7.3968 -2060.8
## <none>                  7.3846 -2059.6
## - z        1    0.0623  7.4469 -2057.4
## - table    1    0.0955  7.4801 -2055.2
## - cut      4    0.3347  7.7193 -2045.4
## - x        1    0.2827  7.6673 -2042.8
## - carat    1    3.2543 10.6390 -1879.0
## - color    6    9.1766 16.5613 -1667.8
## - clarity  7   17.0052 24.3898 -1476.2
## 
## Step:  AIC=-2061.08
## log(price) ~ carat + cut + color + clarity + depth + table + 
##     x + z
## 
##           Df Sum of Sq     RSS     AIC
## - depth    1    0.0047  7.3972 -2062.8
## <none>                  7.3924 -2061.1
## - table    1    0.0951  7.4875 -2056.7
## - z        1    0.1730  7.5654 -2051.5
## - cut      4    0.3908  7.7832 -2043.3
## - x        1    0.2987  7.6911 -2043.3
## - carat    1    3.2532 10.6457 -1880.7
## - color    6    9.1774 16.5699 -1669.5
## - clarity  7   17.2383 24.6308 -1473.3
## 
## Step:  AIC=-2062.76
## log(price) ~ carat + cut + color + clarity + table + x + z
## 
##           Df Sum of Sq     RSS     AIC
## <none>                  7.3972 -2062.8
## - table    1    0.0904  7.4876 -2058.7
## - cut      4    0.4016  7.7987 -2044.3
## - x        1    2.3920  9.7891 -1924.7
## - z        1    2.5299  9.9270 -1917.7
## - carat    1    3.2959 10.6931 -1880.5
## - color    6    9.1986 16.5957 -1670.7
## - clarity  7   17.7297 25.1269 -1465.3
## 
## Call:
## lm(formula = log(price) ~ carat + cut + color + clarity + table + 
##     x + z, data = DiamondData)
## 
## Coefficients:
##  (Intercept)         carat       cutGood      cutIdeal    cutPremium  
##     -0.65205      -0.96135       0.12432       0.18682       0.13700  
## cutVery Good        colorE        colorF        colorG        colorH  
##      0.16381      -0.05103      -0.11215      -0.14390      -0.27686  
##       colorI        colorJ     clarityIF    claritySI1    claritySI2  
##     -0.40480      -0.55660       1.17546       0.72515       0.54168  
##   clarityVS1    clarityVS2   clarityVVS1   clarityVVS2         table  
##      0.93156       0.86275       1.13982       1.03087       0.00835  
##            x             z  
##      0.68759       1.12393
backElim_model <- lm(log(price) ~ carat + cut + color + clarity + table + x + z, data = DiamondData)
summary(backElim_model)
## 
## Call:
## lm(formula = log(price) ~ carat + cut + color + clarity + table + 
##     x + z, data = DiamondData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37845 -0.08277 -0.00156  0.08560  0.37921 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.652054   0.237467  -2.746 0.006263 ** 
## carat        -0.961345   0.065873 -14.594  < 2e-16 ***
## cutGood       0.124323   0.039780   3.125 0.001885 ** 
## cutIdeal      0.186822   0.040829   4.576 6.06e-06 ***
## cutPremium    0.137005   0.039753   3.446 0.000618 ***
## cutVery Good  0.163807   0.038172   4.291 2.15e-05 ***
## colorE       -0.051031   0.021895  -2.331 0.020183 *  
## colorF       -0.112154   0.021535  -5.208 2.84e-07 ***
## colorG       -0.143904   0.021591  -6.665 7.31e-11 ***
## colorH       -0.276855   0.022307 -12.411  < 2e-16 ***
## colorI       -0.404797   0.026539 -15.253  < 2e-16 ***
## colorJ       -0.556602   0.030584 -18.199  < 2e-16 ***
## clarityIF     1.175455   0.050096  23.464  < 2e-16 ***
## claritySI1    0.725152   0.043135  16.811  < 2e-16 ***
## claritySI2    0.541680   0.043956  12.323  < 2e-16 ***
## clarityVS1    0.931561   0.044555  20.908  < 2e-16 ***
## clarityVS2    0.862750   0.044072  19.576  < 2e-16 ***
## clarityVVS1   1.139818   0.048904  23.307  < 2e-16 ***
## clarityVVS2   1.030875   0.046863  21.997  < 2e-16 ***
## table         0.008350   0.003455   2.417 0.016024 *  
## x             0.687591   0.055306  12.433  < 2e-16 ***
## z             1.123926   0.087903  12.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1244 on 478 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9851 
## F-statistic:  1572 on 21 and 478 DF,  p-value: < 2.2e-16

I can see that from using backward elimination, the model with the lowest AIC removed the predictors of depth and y, and provides a model with an \(R^2adj\) of 0.9851 and AIC = -2062.76. The model backwards elimination suggests is:

\[ log(price) = \beta_0 + \beta_1carat + \beta_2cut + \beta_3color + \beta_4clarity + \beta_5table + \beta_6x + \beta_7z + \epsilon \] AIC Forward Selection

null_model <- lm(log(price)~1, DiamondData)
step(null_model, direction='forward', scope = formula(full_model))
## Start:  AIC=19.96
## log(price) ~ 1
## 
##           Df Sum of Sq    RSS      AIC
## + y        1    483.63  34.66 -1330.57
## + x        1    482.36  35.92 -1312.62
## + z        1    478.09  40.19 -1256.47
## + carat    1    451.09  67.19  -999.57
## + cut      4     31.36 486.93    -3.25
## + clarity  7     37.02 481.26    -3.10
## + table    1     22.95 495.33    -0.69
## <none>                 518.28    19.96
## + color    6     11.54 506.74    20.69
## + depth    1      0.41 517.87    21.56
## 
## Step:  AIC=-1330.57
## log(price) ~ y
## 
##           Df Sum of Sq    RSS     AIC
## + clarity  7   11.7853 22.871 -1524.4
## + color    6    6.8256 27.830 -1428.2
## + carat    1    2.8617 31.794 -1371.7
## + depth    1    0.1467 34.509 -1330.7
## + z        1    0.1388 34.517 -1330.6
## <none>                 34.656 -1330.6
## + table    1    0.0690 34.587 -1329.6
## + x        1    0.0307 34.625 -1329.0
## + cut      4    0.4300 34.226 -1328.8
## 
## Step:  AIC=-1524.38
## log(price) ~ y + clarity
## 
##         Df Sum of Sq    RSS     AIC
## + color  6    9.7834 13.087 -1791.5
## + carat  1    3.1211 19.749 -1595.7
## + z      1    1.0323 21.838 -1545.5
## + depth  1    0.8721 21.998 -1541.8
## + x      1    0.3296 22.541 -1529.6
## + cut    4    0.3917 22.479 -1525.0
## <none>               22.870 -1524.4
## + table  1    0.0003 22.870 -1522.4
## 
## Step:  AIC=-1791.49
## log(price) ~ y + clarity + color
## 
##         Df Sum of Sq    RSS     AIC
## + z      1   1.81523 11.272 -1864.2
## + carat  1   1.77949 11.308 -1862.6
## + depth  1   1.56548 11.522 -1853.2
## + x      1   0.64313 12.444 -1814.7
## <none>               13.087 -1791.5
## + table  1   0.01984 13.067 -1790.2
## + cut    4   0.14991 12.937 -1789.2
## 
## Step:  AIC=-1864.15
## log(price) ~ y + clarity + color + z
## 
##         Df Sum of Sq     RSS     AIC
## + carat  1    3.2282  8.0436 -2030.9
## + x      1    0.2556 11.0162 -1873.6
## + depth  1    0.0507 11.2212 -1864.4
## + cut    4    0.1839 11.0880 -1864.4
## <none>               11.2719 -1864.2
## + table  1    0.0305 11.2414 -1863.5
## 
## Step:  AIC=-2030.86
## log(price) ~ y + clarity + color + z + carat
## 
##         Df Sum of Sq    RSS     AIC
## + x      1  0.316138 7.7275 -2048.9
## + depth  1  0.106402 7.9372 -2035.5
## + cut    4  0.197591 7.8461 -2035.3
## <none>               8.0436 -2030.9
## + table  1  0.005318 8.0383 -2029.2
## 
## Step:  AIC=-2048.91
## log(price) ~ y + clarity + color + z + carat + x
## 
##         Df Sum of Sq    RSS     AIC
## + cut    4  0.243656 7.4838 -2056.9
## <none>               7.7275 -2048.9
## + depth  1  0.007798 7.7197 -2047.4
## + table  1  0.000089 7.7274 -2046.9
## 
## Step:  AIC=-2056.93
## log(price) ~ y + clarity + color + z + carat + x + cut
## 
##         Df Sum of Sq    RSS     AIC
## + table  1  0.087090 7.3968 -2060.8
## <none>               7.4838 -2056.9
## + depth  1  0.003759 7.4801 -2055.2
## 
## Step:  AIC=-2060.78
## log(price) ~ y + clarity + color + z + carat + x + cut + table
## 
##         Df Sum of Sq    RSS     AIC
## <none>               7.3968 -2060.8
## + depth  1  0.012125 7.3846 -2059.6
## 
## Call:
## lm(formula = log(price) ~ y + clarity + color + z + carat + x + 
##     cut + table, data = DiamondData)
## 
## Coefficients:
##  (Intercept)             y     clarityIF    claritySI1    claritySI2  
##    -0.646368      0.021823      1.174667      0.724791      0.541568  
##   clarityVS1    clarityVS2   clarityVVS1   clarityVVS2        colorE  
##     0.930980      0.862124      1.139110      1.030157     -0.050855  
##       colorF        colorG        colorH        colorI        colorJ  
##    -0.112033     -0.143824     -0.276638     -0.404476     -0.556470  
##            z         carat             x       cutGood      cutIdeal  
##     1.122292     -0.961953      0.666978      0.122646      0.184874  
##   cutPremium  cutVery Good         table  
##     0.136072      0.161631      0.008273
forward_model <- lm(formula = log(price) ~ y + clarity + color + z + carat + 
    x + cut + table, data = DiamondData)
summary(forward_model)
## 
## Call:
## lm(formula = log(price) ~ y + clarity + color + z + carat + x + 
##     cut + table, data = DiamondData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37802 -0.08234 -0.00160  0.08539  0.37902 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.646368   0.240365  -2.689 0.007415 ** 
## y             0.021823   0.136741   0.160 0.873271    
## clarityIF     1.174667   0.050390  23.312  < 2e-16 ***
## claritySI1    0.724791   0.043238  16.763  < 2e-16 ***
## claritySI2    0.541568   0.044007  12.306  < 2e-16 ***
## clarityVS1    0.930980   0.044749  20.805  < 2e-16 ***
## clarityVS2    0.862124   0.044291  19.465  < 2e-16 ***
## clarityVVS1   1.139110   0.049155  23.174  < 2e-16 ***
## clarityVVS2   1.030157   0.047127  21.859  < 2e-16 ***
## colorE       -0.050855   0.021945  -2.317 0.020906 *  
## colorF       -0.112033   0.021570  -5.194 3.06e-07 ***
## colorG       -0.143824   0.021619  -6.653 7.90e-11 ***
## colorH       -0.276638   0.022372 -12.366  < 2e-16 ***
## colorI       -0.404476   0.026642 -15.182  < 2e-16 ***
## colorJ       -0.556470   0.030626 -18.170  < 2e-16 ***
## z             1.122292   0.088587  12.669  < 2e-16 ***
## carat        -0.961953   0.066050 -14.564  < 2e-16 ***
## x             0.666978   0.140528   4.746 2.74e-06 ***
## cutGood       0.122646   0.041184   2.978 0.003049 ** 
## cutIdeal      0.184874   0.042654   4.334 1.78e-05 ***
## cutPremium    0.136072   0.040220   3.383 0.000776 ***
## cutVery Good  0.161631   0.040571   3.984 7.84e-05 ***
## table         0.008273   0.003491   2.370 0.018192 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1245 on 477 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9851 
## F-statistic:  1498 on 22 and 477 DF,  p-value: < 2.2e-16

Forward selection provided me with a different model where they did not include depth. The \(R^2adj\) of this model is also 0.9851 and AIC = -2056.93, and it looks like this:

\[ log(price) = \beta_0 + \beta_1y + \beta_2clarity + \beta_3color + \beta_4z + \beta_5carat + \beta_6x + \beta_7cut + \beta_8table + \epsilon \] AIC Step Wise Regression

null_model <- lm(log(price)~1, DiamondData)
step(null_model, direction='both', scope = formula(full_model))
## Start:  AIC=19.96
## log(price) ~ 1
## 
##           Df Sum of Sq    RSS      AIC
## + y        1    483.63  34.66 -1330.57
## + x        1    482.36  35.92 -1312.62
## + z        1    478.09  40.19 -1256.47
## + carat    1    451.09  67.19  -999.57
## + cut      4     31.36 486.93    -3.25
## + clarity  7     37.02 481.26    -3.10
## + table    1     22.95 495.33    -0.69
## <none>                 518.28    19.96
## + color    6     11.54 506.74    20.69
## + depth    1      0.41 517.87    21.56
## 
## Step:  AIC=-1330.57
## log(price) ~ y
## 
##           Df Sum of Sq    RSS      AIC
## + clarity  7     11.79  22.87 -1524.38
## + color    6      6.83  27.83 -1428.24
## + carat    1      2.86  31.79 -1371.66
## + depth    1      0.15  34.51 -1330.69
## + z        1      0.14  34.52 -1330.58
## <none>                  34.66 -1330.57
## + table    1      0.07  34.59 -1329.57
## + x        1      0.03  34.63 -1329.01
## + cut      4      0.43  34.23 -1328.81
## - y        1    483.63 518.28    19.96
## 
## Step:  AIC=-1524.38
## log(price) ~ y + clarity
## 
##           Df Sum of Sq    RSS     AIC
## + color    6      9.78  13.09 -1791.5
## + carat    1      3.12  19.75 -1595.7
## + z        1      1.03  21.84 -1545.5
## + depth    1      0.87  22.00 -1541.8
## + x        1      0.33  22.54 -1529.6
## + cut      4      0.39  22.48 -1525.0
## <none>                  22.87 -1524.4
## + table    1      0.00  22.87 -1522.4
## - clarity  7     11.79  34.66 -1330.6
## - y        1    458.39 481.26    -3.1
## 
## Step:  AIC=-1791.49
## log(price) ~ y + clarity + color
## 
##           Df Sum of Sq    RSS      AIC
## + z        1      1.82  11.27 -1864.15
## + carat    1      1.78  11.31 -1862.57
## + depth    1      1.57  11.52 -1853.19
## + x        1      0.64  12.44 -1814.69
## <none>                  13.09 -1791.49
## + table    1      0.02  13.07 -1790.25
## + cut      4      0.15  12.94 -1789.25
## - color    6      9.78  22.87 -1524.38
## - clarity  7     14.74  27.83 -1428.24
## - y        1    455.40 468.49    -4.55
## 
## Step:  AIC=-1864.15
## log(price) ~ y + clarity + color + z
## 
##           Df Sum of Sq     RSS     AIC
## + carat    1    3.2282  8.0436 -2030.9
## + x        1    0.2556 11.0162 -1873.6
## + depth    1    0.0507 11.2212 -1864.4
## + cut      4    0.1839 11.0880 -1864.4
## <none>                 11.2719 -1864.2
## + table    1    0.0305 11.2414 -1863.5
## - z        1    1.8152 13.0871 -1791.5
## - y        1    2.7573 14.0291 -1756.7
## - color    6   10.5663 21.8382 -1545.5
## - clarity  7   16.2254 27.4973 -1432.3
## 
## Step:  AIC=-2030.86
## log(price) ~ y + clarity + color + z + carat
## 
##           Df Sum of Sq     RSS     AIC
## + x        1    0.3161  7.7275 -2048.9
## + depth    1    0.1064  7.9372 -2035.5
## + cut      4    0.1976  7.8461 -2035.3
## <none>                  8.0436 -2030.9
## + table    1    0.0053  8.0383 -2029.2
## - carat    1    3.2282 11.2719 -1864.2
## - z        1    3.2640 11.3076 -1862.6
## - y        1    4.7021 12.7457 -1802.7
## - color    6    9.1056 17.1493 -1664.3
## - clarity  7   17.1175 25.1611 -1474.7
## 
## Step:  AIC=-2048.91
## log(price) ~ y + clarity + color + z + carat + x
## 
##           Df Sum of Sq     RSS     AIC
## + cut      4    0.2437  7.4838 -2056.9
## <none>                  7.7275 -2048.9
## + depth    1    0.0078  7.7197 -2047.4
## + table    1    0.0001  7.7274 -2046.9
## - y        1    0.0713  7.7988 -2046.3
## - x        1    0.3161  8.0436 -2030.9
## - z        1    2.7089 10.4365 -1900.7
## - carat    1    3.2887 11.0162 -1873.6
## - color    6    9.2518 16.9793 -1667.3
## - clarity  7   17.3953 25.1228 -1473.4
## 
## Step:  AIC=-2056.93
## log(price) ~ y + clarity + color + z + carat + x + cut
## 
##           Df Sum of Sq     RSS     AIC
## + table    1    0.0871  7.3968 -2060.8
## - y        1    0.0037  7.4876 -2058.7
## <none>                  7.4838 -2056.9
## + depth    1    0.0038  7.4801 -2055.2
## - cut      4    0.2437  7.7275 -2048.9
## - x        1    0.3622  7.8461 -2035.3
## - z        1    2.5361 10.0199 -1913.0
## - carat    1    3.3531 10.8370 -1873.8
## - color    6    9.1841 16.6679 -1668.6
## - clarity  7   16.9391 24.4229 -1479.5
## 
## Step:  AIC=-2060.78
## log(price) ~ y + clarity + color + z + carat + x + cut + table
## 
##           Df Sum of Sq     RSS     AIC
## - y        1    0.0004  7.3972 -2062.8
## <none>                  7.3968 -2060.8
## + depth    1    0.0121  7.3846 -2059.6
## - table    1    0.0871  7.4838 -2056.9
## - cut      4    0.3307  7.7274 -2046.9
## - x        1    0.3493  7.7461 -2039.7
## - z        1    2.4888  9.8856 -1917.8
## - carat    1    3.2891 10.6859 -1878.8
## - color    6    9.1678 16.5646 -1669.7
## - clarity  7   17.0262 24.4229 -1477.5
## 
## Step:  AIC=-2062.76
## log(price) ~ clarity + color + z + carat + x + cut + table
## 
##           Df Sum of Sq     RSS     AIC
## <none>                  7.3972 -2062.8
## + depth    1    0.0047  7.3924 -2061.1
## + y        1    0.0004  7.3968 -2060.8
## - table    1    0.0904  7.4876 -2058.7
## - cut      4    0.4016  7.7987 -2044.3
## - x        1    2.3920  9.7891 -1924.7
## - z        1    2.5299  9.9270 -1917.7
## - carat    1    3.2959 10.6931 -1880.5
## - color    6    9.1986 16.5957 -1670.7
## - clarity  7   17.7297 25.1269 -1465.3
## 
## Call:
## lm(formula = log(price) ~ clarity + color + z + carat + x + cut + 
##     table, data = DiamondData)
## 
## Coefficients:
##  (Intercept)     clarityIF    claritySI1    claritySI2    clarityVS1  
##     -0.65205       1.17546       0.72515       0.54168       0.93156  
##   clarityVS2   clarityVVS1   clarityVVS2        colorE        colorF  
##      0.86275       1.13982       1.03087      -0.05103      -0.11215  
##       colorG        colorH        colorI        colorJ             z  
##     -0.14390      -0.27686      -0.40480      -0.55660       1.12393  
##        carat             x       cutGood      cutIdeal    cutPremium  
##     -0.96135       0.68759       0.12432       0.18682       0.13700  
## cutVery Good         table  
##      0.16381       0.00835
stepWise_model <- lm(formula = log(price) ~ clarity + color + z + carat + x + 
    cut + table, data = DiamondData)
summary(stepWise_model) 
## 
## Call:
## lm(formula = log(price) ~ clarity + color + z + carat + x + cut + 
##     table, data = DiamondData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37845 -0.08277 -0.00156  0.08560  0.37921 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.652054   0.237467  -2.746 0.006263 ** 
## clarityIF     1.175455   0.050096  23.464  < 2e-16 ***
## claritySI1    0.725152   0.043135  16.811  < 2e-16 ***
## claritySI2    0.541680   0.043956  12.323  < 2e-16 ***
## clarityVS1    0.931561   0.044555  20.908  < 2e-16 ***
## clarityVS2    0.862750   0.044072  19.576  < 2e-16 ***
## clarityVVS1   1.139818   0.048904  23.307  < 2e-16 ***
## clarityVVS2   1.030875   0.046863  21.997  < 2e-16 ***
## colorE       -0.051031   0.021895  -2.331 0.020183 *  
## colorF       -0.112154   0.021535  -5.208 2.84e-07 ***
## colorG       -0.143904   0.021591  -6.665 7.31e-11 ***
## colorH       -0.276855   0.022307 -12.411  < 2e-16 ***
## colorI       -0.404797   0.026539 -15.253  < 2e-16 ***
## colorJ       -0.556602   0.030584 -18.199  < 2e-16 ***
## z             1.123926   0.087903  12.786  < 2e-16 ***
## carat        -0.961345   0.065873 -14.594  < 2e-16 ***
## x             0.687591   0.055306  12.433  < 2e-16 ***
## cutGood       0.124323   0.039780   3.125 0.001885 ** 
## cutIdeal      0.186822   0.040829   4.576 6.06e-06 ***
## cutPremium    0.137005   0.039753   3.446 0.000618 ***
## cutVery Good  0.163807   0.038172   4.291 2.15e-05 ***
## table         0.008350   0.003455   2.417 0.016024 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1244 on 478 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9851 
## F-statistic:  1572 on 21 and 478 DF,  p-value: < 2.2e-16

Step Wise Regression also removed depth and y from the model. And this model also has a \(R^2adj\) value of 0.9851 and AIC = -2062.76. The suggested model looks like this:

\[ log(price) = \beta_0 + \beta_1clarity + \beta_2color + \beta_3z + \beta_4carat + \beta_5x + \beta_6cut + \beta_7table + \epsilon \]

Comparing The Models

Out of the three models that I created using AIC, they all had the same \(R^2adj\) value of 0.9851, however the backward selection model and the step wise regression produced the same model and it had one less predictor, y. This model is not far of from the one I created except it includes table and doesn’t take the log of carat, and the model I created had a slightly higher \(R^2adj\) value. But to make a decision I will run a test to check the AIC value and the BIC value of the three models: my model, forward model, and step wise/backward model.

AIC(forward_model)
## [1] -639.8445
AIC(stepWise_model)
## [1] -641.8178
AIC(lm2)
## [1] -651.7323
BIC(forward_model)
## [1] -538.6939
BIC(stepWise_model)
## [1] -544.8818
BIC(lm2)
## [1] -563.2255

Out of the three models, the model I created where I removed table, y and depth as well as took the log of carat has the lowest AIC and BIC values which is ideal for a model. And as stated before it had the highest \(R^2adj\).

Confidence Interval Estimates

Next I will look at the confidence intervals for each of my predictors with the formula:

\[\hat{\beta}_1-t(\alpha / 2, n-2) \operatorname{se}\left(\hat{\beta}_1\right) \leq \beta_1 \leq \hat{\beta}_1+t(\alpha / 2, n-2) \operatorname{se}\left(\hat{\beta}_1\right)\]

confint(lm2)
##                    2.5 %      97.5 %
## (Intercept)   6.19881666  7.38301882
## log(carat)    1.43109278  1.76482495
## z             0.10327460  0.38810932
## cutGood       0.05537115  0.20616803
## cutIdeal      0.13700501  0.27305171
## cutPremium    0.09573828  0.23406985
## cutVery Good  0.10132021  0.23890869
## clarityIF     1.01100100  1.20604196
## claritySI1    0.57535297  0.74397284
## claritySI2    0.39543555  0.56679837
## clarityVS1    0.78499774  0.95874552
## clarityVS2    0.71492530  0.88684046
## clarityVVS1   0.98527364  1.17566209
## clarityVVS2   0.87752534  1.05996743
## colorE       -0.10642689 -0.02132536
## colorF       -0.15639655 -0.07257967
## colorG       -0.18689257 -0.10323727
## colorH       -0.32131684 -0.23494410
## colorI       -0.44494772 -0.34151972
## colorJ       -0.61578441 -0.49688911
head(exp(predict(lm2, interval = "confidence", level = .95)))
##         fit       lwr       upr
## 1 2543.8621 2420.2871 2673.7466
## 2 3026.5532 2914.9362 3142.4442
## 3  772.6731  738.3613  808.5794
## 4  729.6803  691.7040  769.7416
## 5  922.9210  873.1527  975.5261
## 6 3449.5042 3290.9238 3615.7262
head(exp(predict(lm2, interval = "prediction", level = .95)))
## Warning in predict.lm(lm2, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
##         fit       lwr       upr
## 1 2543.8621 1986.0277 3258.3809
## 2 3026.5532 2367.9951 3868.2615
## 3  772.6731  603.7450  988.8675
## 4  729.6803  569.2384  935.3433
## 5  922.9210  719.6764 1183.5642
## 6 3449.5042 2694.5177 4416.0332

After calculating the confidence interval and the prediction interval, I can see there ranges differ a lot. the prediction interval is way wider. This is because the prediction interval depends on both the error from the fitted model as well as the error associated with the future observations.

Final Model

\[ log(price) = \beta_0 + \beta_1log(carat) + \beta_2color + \beta_3clarity + \beta_4cut + \beta_5z + \epsilon \] Throughout this analysis of the data, I investigated the effect of 9 variables on one response variable. From the beginning I was able to rule out two of the three dimension variables x,y, and z due to mullticollinearity, which makes sense due to the round shape of the diamond. When one dimension grows so do the others. I also was able to rule out depth from my model because it was not significant and did not have much correlation with price. Later in my analysis I noticed I needed to transform some of my variables like price and carat to follow the error assumption of a consistent variance. After putting my model together I noticed that table was hurting my model because it wasn’t as significant as the other variables and it was causing overfitting. Removing table increased the fit of my model and I was able to create a model where 98.53% of the variance in the dependent variable could be explained by the model.